-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Search query phase coordinator duration APM metric. #136059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Search query phase coordinator duration APM metric. #136059
Conversation
to track the duration of the search query phase.
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments, thanks!
|
||
@Override | ||
protected void doRun(Map<SearchShardIterator, Integer> shardIndexMap) { | ||
phaseStartTimeNanos = System.nanoTime(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be potentially streamlined into the run
method in the parent class. We may be able to even report the latency at the coordinator in a generic manner, with all the code in AbstractSearchAsyncAction
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My other PR attempted the "generic" path as well. There is some weirdness around the PIT creation and queries that caused a lot of issues in CI but I think I have an idea of where the issue was. I'll try and move this code into the AbstractSearchAsyncAction
and that should cover at least DFS and query phases.
I'll have to check if that helps fetch other subsequent phases. The issue with them is that they don't subclass off of AbstractSearchAsyncAction
but reference it via a passed in context
so those phases might not hit run
to set the start time of the phase. I'll have to do some debug tracing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean. I'd limit the change to tracking the two variations of query phase and perhaps open point in time. The AbstractSearchAsyncAction
subclasses to be more concrete. I would not consider the other so called search phases that important to be honest. I care about can match, dfs, query, fetch. We can expand further later as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, does "dfs" cover both the "DFS roundtrip" and the DFS specific query implementation (i.e. basically timing SearchDfsQueryThenFetchAction
)? Or do we want "DFS roundtrip", "normal query", "DFS specific query" in addition to can match and fetch times?
request.getMaxConcurrentShardRequests(), | ||
clusters | ||
clusters, | ||
coordinatorSearchPhaseAPMMetrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit surprised that we don't record latency for this. I don't want to confuse you , I don't mean reporting dfs phase latency at the coordinator. What I mean is that DFS query then fetch has an additional DFS roundtrip in the beginning, but after DFS it executes the query phase, yet the codepath is all in SearchDfsQueryThenFetchAction
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah its a separate code path if you do a DFS query. It joins the "normal" code path when you get back to the Fetch and subsequent phases. In my original PR, it was reporting a "dfs" and a "dfs_query" phase duration. Not for this PR but for the future one where we record the DFS phase metric, do we want to record a separate DFS roundtrip metric? Also, do we want to differentiate the two code query code paths with two different metrics (DFS and non-DFS query phases) or record both paths with the same query phase metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question, I'd keep the two variations of query phase separate.
server/src/main/java/org/elasticsearch/index/search/stats/CoordinatorSearchPhaseAPMMetrics.java
Outdated
Show resolved
Hide resolved
...st/java/org/elasticsearch/search/TelemetryMetrics/CoordinatorSearchPhaseAPMMetricsTests.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/search/stats/CoordinatorSearchPhaseAPMMetrics.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/search/AbstractSearchAsyncAction.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some code structure comments
Added es.search.coordinator.phases.query.duration.histogram APM metric to track the duration of the search query phase at the coordinator level..